Inspecting hematopoietic development in the mouse embryo through a dataset of 3,934 cells “with blood-forming potential captured at four time points between E7.0 and E8.5” and their levels of expression for 46 different genes, Moignard et al. utilized unsupervised analysis techniques to reconstruct the “transcriptional programs that underpin organogenesis” (Moignard et al.). In particular, Moignard et al. utilized hierarchical (agglomerative) clustering, dimensionality reduction in the form of multidimensional scaling (MDS), diffusion mapping, and principal components analysis (PCA), and network synthesis to reconstruct boolean update rules to analyze the molecular pathways behind development. We further their analysis by applying additional unsupervised techniques, including k-means clustering, nonlinear (kernel) PCA, local linear embedding (LLE), and association rules to compare with recreated PCA analysis and further confirm their conclusions.
Despite the presence of a response variable in the form of cell classification, we must use unsupervised learning techniques with the Moignard et al. gene expression data to infer the shape and properties of the probability density of gene expression (ESL 502). That density corresponds to the molecular pathways of hematopoetic cellular development over time. Because the gene expression data is high-dimensional, the curse of dimensionality implies that traditional attempts to estimate the density of gene expression will fail, thus necessitating dimensionality reduction and other techniques (ESL 502).
Dimensionality reduction techniques span from linear methods such as PCA to more recent nonlinear methods such as LLE. Nonlinear methods can be broadly classified into those that preserve global properties such as distance (e.g. the MDS and diffusion map methods used by Moignard et al.) or use nonlinear kernels to relate data points (e.g. kernel PCA) and those that preserve local properties such as LLE (van der Maaten et al.). While nonlinear dimensionality reduction methods perform well on artificial datasets, linear PCA has traditionally outperformed nonlinear methods on natural datasets, as the curse of dimensionality inhibits the performance of local nonlinear methods and algorithms face difficulty in finding extraordinarily small eigenvalues when solving the eigenproblem (van der Maaten et al.).
However, such comparisons have not yet been made for gene expression data. Moreover, while the gene expression data is high-dimensional, the number of samples is significantly larger than the number of predictors. The number of predictors is also less than fifty, and the Moignard et al. analysis suggests the existence of a three-dimensional manifold embedded in this higher-dimensional space which captures most of the dataset’s original variance (Moignard et al.). Thus, we recreate PCA analysis on the normalized gene expression dataset and compare it to various kernel PCAs and LLE.
Clustering techniques model the probability space as a mixture of multiple densities that represent distinct classes of data (ESL 502). Moignard et al. performed agglomerative or hierarchical clustering on both the cell samples and the gene factors, finding the existence of three broad clusters. Cluster I is small, consisting primarily of PS and NP cells, with no expression of erythroid genes and low expression of endothelial genes, but high Cdh1. Cluster II was the largest, consisting of all non-4SG cells, with subclusters of endothelial and hemogenic endothelial cells. Cluster III was comprised almost entirely of 4SG cells, which express hematopoetic genes and low to no levels of endothelial genes (Sox, Hox, and Erg).
To supplement their analysis, we use k-means clustering, which randomly partitions the data set into n clusters, and then iterates over the given cluster sets by repeatedly calculating the centroid of each cluster and assigning each sample to its closest neighboring centroid. Using the within-sum-of-squares (WSS) metric and the GAP statistic, we will be able to find the optimal number of clusters into which the gene expression data partitions.
Association rules analysis builds “conjunctive rules” from high-dimensional binary data in order to describe “regions of high density” (ESL 486). Certain algorithms that find assocation rules have been successfully applied to gene expression data in a supervised setting that seeks to profile phenotypes (Chen et al.). In fact, Moignard et al. create thresholds for binary discretization of their gene expression data for network synthesis constructed from a boolean satisfiability problem (Moignard et al.). Their successful analysis suggests that applying association rules to the discretized binary data might find sets of “on” genes that cluster together and tend to imply other “on” genes. Unfortunately, association rules analysis on even small datasets such as the given binary dataset produces hundreds of thousands to tens of millions of rules. To mitigate this issue, we focus on a narrow problem: demonstrating the roles that Sox and Hox factors play in activating Erg, which controls HSC development.
The original PCA analysis demonstrated the existence of a three-dimensional manifold that captures most of the variance of the original high-dimensional data. Movement across the manifold represents development over time, where movement towards and clustering upon a certain portion of the manifold represents differentiation. In particular, the original PCA analysis demonstrated that the 4SG cells were highly-separable from the rest of the cells, whereas the rest of the cells were more gradually and less tightly clustered across the manifold.
The LLE analysis merely confirmed the original PCA analysis without adding any new insights in that it showed the existence of a time gradient for gene expression. However, because LLE created an embedding that collapsed different original high-dimensional data points into the same low-dimensional data points, the embedding lacked granularity. On the other hand, kernel PCAs provided additional granular views of the same high-dimensional gene expression data on a three-dimensional manifold. Each kernel PCA explained marginally more variance than the original PCA analysis, but consumed more computational time and storage. Overall, this indicates that for certain forms of gene expression data involving developing and differentiating cells, vanilla PCA may provide the most simple and least intensive dimensionality reduction analysis of the original data.
The clustering analysis using the optimal number of clusters for the kmeans algorithm demonstrated the correctness of the original hierarchical clustering, in which Moignard et al. claimed the existence of three overarching clusters, with some subclusters in the second and largest cluster. Inspection of the resulting centroids as representative of their overall respective clusters showed additional gradients for endothelial and hematopoetic development.
The association rules analysis demonstrated the viability of linking HoxB4 and Sox17 to the activation of Erg for HSC development. Moreover, the analysis confirmed already-known linkages, such as complexing between Gata1 and Gfi1b. However, the analysis produced tens of millions of rules, even when controlling for high levels of support and confidence, which had large levels of lift. Without sophisticated computing resources, such association rules analysis may not be manageable when the data have more than ten or so features. Still, the use of subsetting in finding sets of “on” genes that imply a certain gene, or finding the implications of a certain gene being “on”, was more effective in demonstrating connections between transcription factors. Future research into association rules analysis, particularly in contexts of high dimension, must be done before it can be used a primary tool for unsupervised analysis.
Moignard et al.’s original salient conclusions appear to be correct, including the activation of the endothelial gene Erg by Sox17 and HoxB4 factors and the broad clustering of cells that mixes both cell type and types of genes expressed (endothelial vs. hematopoetic). Moreover, Moignard et al. utilized the optimal dimensionality reduction and rules-based analysis available in this unsupervised learning context, as the overall added complexity of nonlinear dimensionality reductions provides only marginal benefit, and association rules analysis currently lacks the tools and concepts necessary to make sense of implications in high-dimensionsional datasets. At the same time, the kmeans clustering analysis provided a less complicated and more manageable approach to partitioning the cells into clusters, with clear implications drawn from expression levels of the centroids.
Let’s first take a look at the data provided by Moignard et al. in their paper “Decoding the regulatory network of early blood development from single-cell gene expression measurements”. To get a rough understanding of the values included, we’ll first create boxplots of both the raw and the normalized data.
## Loading required package: ggbiplot
## Loading required package: ggplot2
## Loading required package: plyr
## Loading required package: scales
## Loading required package: grid
## Loading required package: rgl
## Loading required package: scatterplot3d
## Loading required package: kernlab
##
## Attaching package: 'kernlab'
##
## The following object is masked from 'package:scales':
##
## alpha
##
## Loading required package: lle
## Loading required package: MASS
## Loading required package: snowfall
## Loading required package: snow
## Loading required package: arules
## Loading required package: Matrix
##
## Attaching package: 'arules'
##
## The following object is masked from 'package:kernlab':
##
## size
##
## The following objects are masked from 'package:base':
##
## %in%, write
##
## Loading required package: arulesViz
##
## Attaching package: 'arulesViz'
##
## The following object is masked from 'package:base':
##
## abbreviate
##
## Loading required package: cluster
## typesort sortA sortB
##
## 3175 759
ggplot(stack(Raw), aes(x = ind, y = values)) + geom_boxplot(aes(fill = ind))
ggplot(stack(Nor), aes(x = ind, y = values)) + geom_boxplot(aes(fill = ind))
Although the number of samples (cells studied) is greater than the number of predictors or features (transcription factors/genes studied), both the raw and normalized data are still too high-dimensional to be analyzed sensibly in an unsupervised environment without some forms of dimensionality reduction.
dim(Raw)
## [1] 3934 46
dim(Nor)
## [1] 3934 46
From here, we will only be looking at the normalized data.
Our staple dimensionality reduction tool for continuous data is principal components analysis (PCA), which we perform first on the expression data. In short, PCA identifies new predictors or features (“principal components”) that are linear combinations of the original predictors or features (transcription factors/genes studied) such that each new principal component explains as much of the remaining variance in the data while remaining orthogonal (perpendicular) to the previous principal components. Simply put, the original dataset shows data points scattered throughout some high-dimensional space, and PCA identifies lower dimensional subspaces (implying linearity) that show the same data points with as much similar scatter as possible.
We perform PCA on the data and plot the first two and the first three principal components:
pr.nor <- prcomp(Nor, scale = FALSE)
plot(pr.nor$x[,1:2], col = cellcol, pch = 20, xlab = "PC1", ylab = "PC2")
scatterplot3d(pr.nor$x[,1:3], color = cellcol, pch = 20, xlab = "PC1", ylab = "PC2", zlab = "PC3", angle = 30)
How representative of the original dataset are our new principal components? We can take a look at the proportion of variance explained (PVE) by the principal components:
pve <- 100 * pr.nor$sdev^2/sum(pr.nor$sdev^2)
plot(pve, type = "o", ylab = "PVE", xlab = "Principal Component", col = "blue")
plot(cumsum(pve), type = "o", ylab = "Cumulative PVE", xlab = "Principal Component", col = "brown3")
In particular, the first three principal components have the following respective percentages of variance explained:
pve[1]
## [1] 33.79748
pve[2]
## [1] 23.55951
pve[3]
## [1] 4.829649
Clearly, based on the above graphs, the first two to three principal components provide the most efficient lower-dimensional projection of our high-dimension dataset.
In this case, when we performed PCA on the dataset, we did not scale our expression levels for each gene. After all, expression levels for each gene are already in the same units, and relative amounts of expression qualitatively matter when considering the molecular pathways of blood cell development. Let’s validate this claim by performing PCA on the dataset after scaling each column of the dataset (expression values for each particular gene/transcription factor).
pr.norsc <- prcomp(Nor, scale = TRUE)
summary(pr.norsc)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 3.4332 2.8843 1.51999 1.32798 1.27183 1.03645
## Proportion of Variance 0.2562 0.1809 0.05023 0.03834 0.03516 0.02335
## Cumulative Proportion 0.2562 0.4371 0.48732 0.52565 0.56082 0.58417
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 1.02351 1.00810 0.99139 0.98162 0.95537 0.93766
## Proportion of Variance 0.02277 0.02209 0.02137 0.02095 0.01984 0.01911
## Cumulative Proportion 0.60694 0.62904 0.65040 0.67135 0.69119 0.71031
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.90173 0.86982 0.84576 0.83099 0.81490 0.80272
## Proportion of Variance 0.01768 0.01645 0.01555 0.01501 0.01444 0.01401
## Cumulative Proportion 0.72798 0.74443 0.75998 0.77499 0.78943 0.80344
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 0.78703 0.76955 0.74952 0.73427 0.69663 0.68136
## Proportion of Variance 0.01347 0.01287 0.01221 0.01172 0.01055 0.01009
## Cumulative Proportion 0.81690 0.82978 0.84199 0.85371 0.86426 0.87435
## PC25 PC26 PC27 PC28 PC29 PC30
## Standard deviation 0.67085 0.66502 0.63076 0.62750 0.61089 0.58028
## Proportion of Variance 0.00978 0.00961 0.00865 0.00856 0.00811 0.00732
## Cumulative Proportion 0.88413 0.89375 0.90240 0.91096 0.91907 0.92639
## PC31 PC32 PC33 PC34 PC35 PC36
## Standard deviation 0.57403 0.55970 0.54769 0.53138 0.50888 0.50016
## Proportion of Variance 0.00716 0.00681 0.00652 0.00614 0.00563 0.00544
## Cumulative Proportion 0.93355 0.94036 0.94689 0.95302 0.95865 0.96409
## PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 0.48054 0.46831 0.45562 0.44675 0.43683 0.42257
## Proportion of Variance 0.00502 0.00477 0.00451 0.00434 0.00415 0.00388
## Cumulative Proportion 0.96911 0.97388 0.97839 0.98273 0.98688 0.99076
## PC43 PC44 PC45 PC46
## Standard deviation 0.40655 0.36611 0.35454 5.184e-10
## Proportion of Variance 0.00359 0.00291 0.00273 0.000e+00
## Cumulative Proportion 0.99435 0.99727 1.00000 1.000e+00
In this case, the first three principal components still explain more than 45 percent of the original dataset’s variance, implying that even under scaling, PCA can find two to three orthogonal directions through the dataset that explain almost a majority of the variance in the high-dimensional dataset.
Since we know each cell sample’s original cell type, we can actually create a biplot that not only plots each cell sample and each transcription factor by the first two principal components, but also colors and “clusters” each cell sample by cell type:
pp <- ggbiplot(pr.nor, obs.scale = 1, var.scale = 1, scale = 1, groups = celltypes, ellipse = TRUE, circle = TRUE)
pp
The above plot singlehandedly provides extensive orientation regarding gene expression patterns across the five cell types. In particular, the directions of expression for any particular gene match predictions made before and in Moignard et al. For instance, expression of E-cadharin (Cdh1) is restricted to the primitive mesoderm (PS) cells, expression of endothelial genes (such as Sox, Hox, Erg, and Cdh5 factors) is restricted to the non-hematopoetic cell types of 4GF (and likely progenitors in the form of HF and NP cells), and expression of hematopoetic genes (such as HbbbH1, Ikaros, Gata1, and Gfi1b) is restricted to 4G hematopoetic cells. However, it does not provide any additional information regarding which genes activate others during differentiation and over time.
We first attempt to build an LLE of our gene expression data using ten neighbors. In other words, for each particular high-dimensional sample (cell), the LLE algorithm will find the n-closest neighbors and compute the high-dimensional position of the sample as a linear combination of the actual high-dimensional positions of the neighbors. Then, the LLE algorithm will find a lower dimensional space onto which it projects the samples, while trying to the weights of the linear combinations.
Here, we try LLE with the ten closest neighbors. We are interested in finding an LLE in three dimensions, as our PCA plots demonstrated the presence of a manifold in three dimensions.
l <- lle(X = Norm, m = 3, k = 10)
## finding neighbours
## calculating weights
## computing coordinates
scatterplot3d(l$Y, color = cellcol, pch = 20, xlab = "PC1", ylab = "PC2", zlab = "PC3", angle = 20)
The plot is not smooth because LLE tends to collapse data points in very high dimensions to single points in the specified lower dimensions (Maaten et al.). While the plot does demonstrate a highly linear gradient and preserves the same separation of 4SG cells, the inability to make a granular inspection of the embedding means that the LLE plot merely confirms the existence of the manifold, without actually providing any new insights.
First, we use a polynomial kernel in an attempt to qualitatively improve the amount of variance explained by the principal components. Using the “kernlab” package, we first try PCA using polynomials of degree two and plot the first two principal components:
kp.nor <- kpca(Norm, kernel = "polydot", kpar = list(degree = 2))
plot(pcv(kp.nor)[,1:2], col = cellcol, pch = 20, xlab = "PC1", ylab = "PC2")
Next, we plot the first three principal components:
scatterplot3d(pcv(kp.nor)[,1:3], color = cellcol, pch = 20, xlab = "PC1", ylab = "PC2", zlab = "PC3", angle = 30)
scatterplot3d(pcv(kp.nor)[,1:3], color = cellcol, pch = 20, xlab = "PC1", ylab = "PC2", zlab = "PC3", angle = -150)
The following is a captured image of an interactive plot made using “plot3d” of the first three principal components:
Graphically, the three-dimensional manifold mapped by the first three principal components of a two-degree polynomial PCA appears “tighter” or more well-defined than the one mapped by the first three PCs of a linear PCA. At the same time, the PVE suggests that this PCA does not analytically “improve” the amount of detail that is captured in the lower-dimensional subspace:
pve <- 100 * eig(kp.nor)/sum(eig(kp.nor))
pve[1:3]/sum(pve)
## Comp.1 Comp.2 Comp.3
## 0.32651700 0.23380464 0.05321668
par(mfrow = c(1, 2))
plot(pve, type = "o", ylab = "PVE", xlab = "Principal Component", col = "blue")
plot(cumsum(pve), type = "o", ylab = "Cumulative PVE", xlab = "Principal Component", col = "brown3")
Next, we try PCA using polynomials of degree three and plot the first two principal components:
kp.nordeg3 <- kpca(Norm, kernel = "polydot", kpar = list(degree = 3))
plot(pcv(kp.nordeg3)[,1:2], col = cellcol, pch = 20, xlab = "PC1", ylab = "PC2")
Next, we plot the first three principal components:
scatterplot3d(pcv(kp.nordeg3)[,1:3], color = cellcol, pch = 20, xlab = "PC1", ylab = "PC2", zlab = "PC3", angle = -250)
scatterplot3d(pcv(kp.nordeg3)[,1:3], color = cellcol, pch = 20, xlab = "PC1", ylab = "PC2", zlab = "PC3", angle = 300)
Visually, the displayed manifold is a single, smooth curve that emphasizes the separation of the 4SG cells at the end of this development cycle. Approximately 60% of the original variance is explained by the first three principal components.
pve <- 100 * eig(kp.nordeg3)/sum(eig(kp.nordeg3))
pve[1:3]/sum(pve)
## Comp.1 Comp.2 Comp.3
## 0.35284291 0.19460562 0.05832961
par(mfrow = c(1, 2))
plot(pve, type = "o", ylab = "PVE", xlab = "Principal Component", col = "blue")
plot(cumsum(pve), type = "o", ylab = "Cumulative PVE", xlab = "Principal Component", col = "brown3")
Finally, we use a radial basis kernel for PCA. We only compute the first three principal components to save computation and storage, and then plot the first two principal components:
kp.norrbf <- kpca(Norm, kernel = "rbfdot", kpar = list(sigma = 0.0005), features = 3)
plot(pcv(kp.norrbf), col = cellcol, pch = 20, xlab = "PC1", ylab = "PC2")
Next, we plot the first three principal components:
scatterplot3d(pcv(kp.norrbf), color = cellcol, pch = 20, xlab = "PC1", ylab = "PC2", zlab = "PC3", angle = -20)
Each of these kernels demonstrates the validity of the original PCA plots, in which a three-dimensional manifold was observed. With all three types of kernels, we observed the same clear separation between 4SG (red) cells and the rest of the cells, in addition to a gradient involving the other four types of cells. We must identify clusters and subclusters of cells within the larger gradient to gain a more granular understanding of the dataset.
First, we need to determine the optimal number of clusters in the dataset. We plot the WSS (within groups sum of squares) corresponding to the results of kmeans performed for cluster sizes 2 to 15:
wss <- c()
for (i in 2:15)
wss[i] <- sum(kmeans(Nor, centers = i)$withinss)
plot(1:15, wss, type = "b", xlab = "Number of clusters", ylab = "WSS")
The “elbow” appears at around 6, so the optimal number of clusters according to this informal graphical method is around 7.
We can also use the GAP statistic to find the optimal number of clusters (Tibshirani et al.):
cGap <- clusGap(Nor, kmeans, 15, B = 100, verbose = interactive())
cGap
## Clustering Gap statistic ["clusGap"].
## B=100 simulated reference sets, k = 1..15
## --> Number of clusters (method 'firstSEmax', SE.factor=1): 7
## logW E.logW gap SE.sim
## [1,] 10.69050 11.19288 0.5023754 0.001256963
## [2,] 10.57500 11.14463 0.5696355 0.004323713
## [3,] 10.37354 11.11579 0.7422496 0.001082916
## [4,] 10.31871 11.09632 0.7776103 0.001187192
## [5,] 10.27345 11.08967 0.8162187 0.001255677
## [6,] 10.25050 11.08363 0.8331341 0.001236675
## [7,] 10.23731 11.07810 0.8407954 0.001209351
## [8,] 10.23378 11.07318 0.8393961 0.001336545
## [9,] 10.22139 11.06867 0.8472804 0.001188872
## [10,] 10.21152 11.06481 0.8532941 0.001116645
## [11,] 10.20472 11.06107 0.8563493 0.001227750
## [12,] 10.19530 11.05769 0.8623952 0.001216860
## [13,] 10.19294 11.05475 0.8618066 0.001234045
## [14,] 10.18245 11.05186 0.8694033 0.001176497
## [15,] 10.17301 11.04917 0.8761666 0.001126556
The GAP statistic also demonstrates that the optimal number of clusters is around 7.
Thus, we run the kmeans clustering algorithm on the data to find 7 clusters and inspect the centroids of the clusters (using a garish gradient from black to pink so each cluster looks distinct):
km.out <- kmeans(Nor, centers = 7, nstart = 20)
prs <- data.frame(pr.nor$x)
centroids <- km.out$centers
ggplot(prs, aes(x = PC1, y = PC2, colour = km.out$cluster, shape = celltypes)) + geom_point() + scale_colour_gradient(low="black", high="pink")
We can then compare the levels of expression for certain genes by cluster by inspecting the centroids. First, we take a look at endothelial genes, such as Erg, Sox7, Sox17, HoxB4, and Cdh5.
centroids[,c("Erg", "Sox7", "Sox17", "HoxB4", "Cdh5")]
## Erg Sox7 Sox17 HoxB4 Cdh5
## 1 -5.657702 -4.392338 -12.689546 -7.214614 0.4037315
## 2 -13.428199 -12.336337 -12.925395 -10.517164 -13.5222979
## 3 -2.167968 -1.781390 -3.584136 -5.778222 3.6608800
## 4 -13.913095 -12.443622 -7.587657 -13.331000 -13.9137016
## 5 -13.936139 -13.778266 -13.990680 -13.913158 -13.4762738
## 6 -5.515506 -1.696637 -5.996344 -7.278214 0.4924869
## 7 -11.050919 -3.317122 -9.406439 -11.287587 -9.4587777
The cluster with the lowest expression of endothelial genes across the board represents the highly-separated 4SG genes, or Cluster III from the Moignard et al. analysis. The cluster with the highest expression of endothelial genes across the board represents mostly PS and NP cells, or Cluster I from the Moignard et al. analysis. Graphically, as we move away from Cluster I (in either direction on the manifold), the endothelial genes become less and less expressed.
Second, we take a look at hematopoetic genes, such as Hbb-bH1, Gata1, Nfe2, Gfi1b, Ikaros, and Myb.
centroids[,c("HbbbH1", "Gata1", "Nfe2", "Gfi1b", "Ikaros", "Myb")]
## HbbbH1 Gata1 Nfe2 Gfi1b Ikaros Myb
## 1 -6.956291 -7.458188 -6.647620 -3.439809 -3.095237 -2.486049
## 2 -11.927792 -13.638337 -13.510728 -13.580726 -13.725458 -12.432889
## 3 -11.239576 -13.838621 -12.508002 -12.192791 -11.339907 -12.250488
## 4 -11.719080 -12.825515 -13.512701 -13.485849 -13.765449 -12.985882
## 5 6.599948 -2.732422 -1.671084 -1.748268 -2.725703 -4.126775
## 6 -11.731799 -13.645862 -12.745990 -12.862657 -10.060498 -9.166072
## 7 -11.725466 -13.656746 -13.370423 -13.309723 -13.147767 -11.074483
Here, we see clear patterns of differentiation. The cluster with the highest expression of hematopoetic genes across the board represents the highly-separated 4SG genes, or Cluster III from the Moignard et al. analysis. Its neighboring cluster (on the manifold) also expresses high levels of these hematopoetic genes. On the other hand, all other clusters have little to no expression of these genes, representing the other “arm” of the manifold. This indicates that movement on the bottom “arm” sends cells towards a hematopoetic future.
Thus, the kmeans clustering does indicate the correctness of the agglomerative clustering performed in Moignard et al. Below are visualizations of the clusters in three-dimensional space.
Here, we perform our association rules analysis using the “arules” package. We first load the binary data and convert it into transactional form. In other words, for each row or cell, the transaction consists of all of those genes that are expressed or “on.”
Norb <- read.csv("~/Downloads/nbt.3154-S6-binary.csv", row.names = 1)
Norbm <- as.matrix(Norb)
Norbt <- as(Norbm, "transactions")
summary(Norbt)
## transactions as itemMatrix in sparse format with
## 1448 rows (elements/itemsets/transactions) and
## 33 columns (items) and a density of 0.6188892
##
## most frequent items:
## Ets2 FoxO4 FoxH1 Ldb1 Tal1 (Other)
## 1442 1442 1438 1433 1400 22418
##
## element (itemset/transaction) length distribution:
## sizes
## 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 1 6 9 23 33 52 74 88 96 98 104 124 112 140 123 104 96 74
## 27 28 29 30 31
## 50 24 12 4 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 17.00 21.00 20.42 24.00 31.00
##
## includes extended item information - examples:
## labels
## 1 Cbfa2t3h
## 2 Erg
## 3 Ets1
##
## includes extended transaction information - examples:
## transactionID
## 1 HFB9_177
## 2 HFB12_136
## 3 NPA6_034
The following is a frequency of expression for each gene across the entire “transaction database,” or the cells:
itemFrequencyPlot(Norbt, support = 0.3, cex.names = 0.5)
To understand the role that Hox and Sox factors play in controlling Erg, we must first find all the rules for which Erg is the implication, or the sets of “on” genes that indicate that Erg is “on” as well. Generally, a rule finds a set of genes, denoted X, that imply a separate set of genes, denoted Y. Note the quality metrics of support and confidence. The support of X refers to the frequency that X appears in the transaction database. The confidence of a rule refers to the proportion the transactions that contain X also contain Y. Finally, the lift of a rule is the ratio of the observed support to that expected if X and Y were independent. Thus, we develop our rules while requiring high levels of support and confidence:
r <- apriori(Norbt, control = list(verbose = F), parameter = list(minlen=2, supp = 0.2, conf = 0.7), appearance = list(default = "lhs", rhs = c("Erg")))
summary(r)
## set of 573623 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6 7 8 9 10
## 6 157 1426 7380 25516 63575 118603 169391 187569
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 8.0 9.0 8.7 10.0 10.0
##
## summary of quality measures:
## support confidence lift
## Min. :0.2003 Min. :0.7000 Min. :1.247
## 1st Qu.:0.2196 1st Qu.:0.8232 1st Qu.:1.466
## Median :0.2507 Median :0.8571 Median :1.527
## Mean :0.2712 Mean :0.8490 Mean :1.512
## 3rd Qu.:0.3018 3rd Qu.:0.8820 3rd Qu.:1.571
## Max. :0.5580 Max. :0.9327 Max. :1.661
##
## mining info:
## data ntransactions support confidence
## Norbt 1448 0.2 0.7
The algorithm orders the developed rules by length. Thus, we can inspect the first ten rules and see the connetions between Mecom, Sox17, HoxB4, Meis1, Sox7, Notch1, and Erg:
inspect(r[1:10])
## lhs rhs support confidence lift
## 1 {Mecom} => {Erg} 0.2389503 0.8693467 1.548357
## 2 {Sox17} => {Erg} 0.3466851 0.8408710 1.497640
## 3 {HoxB4} => {Erg} 0.3743094 0.8262195 1.471545
## 4 {Meis1} => {Erg} 0.5138122 0.7315634 1.302957
## 5 {Sox7} => {Erg} 0.5379834 0.7728175 1.376433
## 6 {Notch1} => {Erg} 0.5448895 0.7485769 1.333259
## 7 {Mecom,
## Meis1} => {Erg} 0.2313536 0.8862434 1.578451
## 8 {Mecom,
## Sox7} => {Erg} 0.2389503 0.8987013 1.600639
## 9 {Mecom,
## Notch1} => {Erg} 0.2375691 0.8911917 1.587264
## 10 {Etv2,
## Mecom} => {Erg} 0.2147790 0.8687151 1.547232
We can also sort the rules by decreasing size of lift:
r.sorted <- sort(r, by = "lift")
inspect(r.sorted[1:10])
## lhs rhs support confidence lift
## 1 {Cbfa2t3h,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1} => {Erg} 0.2009669 0.9326923 1.661179
## 2 {Cbfa2t3h,
## Ets1,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1} => {Erg} 0.2009669 0.9326923 1.661179
## 3 {Cbfa2t3h,
## Fli1,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1} => {Erg} 0.2009669 0.9326923 1.661179
## 4 {Cbfa2t3h,
## Etv6,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1} => {Erg} 0.2009669 0.9326923 1.661179
## 5 {Cbfa2t3h,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1,
## Tal1} => {Erg} 0.2009669 0.9326923 1.661179
## 6 {Cbfa2t3h,
## FoxH1,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1} => {Erg} 0.2009669 0.9326923 1.661179
## 7 {Cbfa2t3h,
## FoxO4,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1} => {Erg} 0.2009669 0.9326923 1.661179
## 8 {Cbfa2t3h,
## Ets2,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1} => {Erg} 0.2009669 0.9326923 1.661179
## 9 {Cbfa2t3h,
## Ets1,
## Fli1,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1} => {Erg} 0.2009669 0.9326923 1.661179
## 10 {Cbfa2t3h,
## Ets1,
## Etv6,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1} => {Erg} 0.2009669 0.9326923 1.661179
To see the influence of Hox and Sox factors on Erg, we subset our set of Erg rules to find those including any of HoxB2, HoxB4, HoxD8, Sox17, and Sox7.
r.hoxSox <- subset(r, subset = (lhs %in% c("HoxB2", "HoxB4", "HoxD8", "Sox17", "Sox7")))
summary(r.hoxSox)
## set of 340898 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6 7 8 9 10
## 3 61 565 3172 12046 32828 66584 102875 122764
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 8.00 9.00 8.82 10.00 10.00
##
## summary of quality measures:
## support confidence lift
## Min. :0.2003 Min. :0.7000 Min. :1.247
## 1st Qu.:0.2196 1st Qu.:0.8444 1st Qu.:1.504
## Median :0.2500 Median :0.8689 Median :1.548
## Mean :0.2677 Mean :0.8620 Mean :1.535
## 3rd Qu.:0.2983 3rd Qu.:0.8873 3rd Qu.:1.580
## Max. :0.5380 Max. :0.9327 Max. :1.661
##
## mining info:
## data ntransactions support confidence
## Norbt 1448 0.2 0.7
inspect(r.hoxSox[1:10])
## lhs rhs support confidence lift
## 1 {Sox17} => {Erg} 0.3466851 0.8408710 1.497640
## 2 {HoxB4} => {Erg} 0.3743094 0.8262195 1.471545
## 3 {Sox7} => {Erg} 0.5379834 0.7728175 1.376433
## 4 {Mecom,
## Sox7} => {Erg} 0.2389503 0.8987013 1.600639
## 5 {HoxB4,
## Sox17} => {Erg} 0.2368785 0.8863049 1.578560
## 6 {Lmo2,
## Sox17} => {Erg} 0.2265193 0.8453608 1.505636
## 7 {Sox17,
## Tbx20} => {Erg} 0.2686464 0.8155136 1.452477
## 8 {Meis1,
## Sox17} => {Erg} 0.3301105 0.8675136 1.545092
## 9 {Sox17,
## Sox7} => {Erg} 0.3466851 0.8451178 1.505204
## 10 {Notch1,
## Sox17} => {Erg} 0.3459945 0.8462838 1.507280
We also take a look at all of those rules that do not include any Hox or Sox factors:
r.notHoxSox <- subset(r, subset = !(lhs %in% c("HoxB2", "HoxB4", "HoxD8", "Sox17", "Sox7")))
summary(r.notHoxSox)
## set of 232725 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6 7 8 9 10
## 3 96 861 4208 13470 30747 52019 66516 64805
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 8.000 9.000 8.524 10.000 10.000
##
## summary of quality measures:
## support confidence lift
## Min. :0.2003 Min. :0.7000 Min. :1.247
## 1st Qu.:0.2189 1st Qu.:0.8005 1st Qu.:1.426
## Median :0.2521 Median :0.8338 Median :1.485
## Mean :0.2762 Mean :0.8300 Mean :1.478
## 3rd Qu.:0.3073 3rd Qu.:0.8644 3rd Qu.:1.539
## Max. :0.5580 Max. :0.9135 Max. :1.627
##
## mining info:
## data ntransactions support confidence
## Norbt 1448 0.2 0.7
Clearly, while the majority of Erg rules involve some Hox or Sox factors, the quality statistics of those involving Hox or Sox factors are almost equivalent to those not involving Hox or Sox factors. Therefore, we take a more refined approach by examining the subset of rules that involve HoxB4 or Sox17 factors, which Moignard et al. demonstrated as key determinants of Erg.
r.refined <- subset(r, subset = (lhs %in% c("HoxB4", "Sox17")))
summary(r.refined)
## set of 179071 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6 7 8 9 10
## 2 39 349 1899 6992 18431 36045 53613 61701
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 8.000 9.000 8.767 10.000 10.000
##
## summary of quality measures:
## support confidence lift
## Min. :0.2003 Min. :0.7919 Min. :1.410
## 1st Qu.:0.2169 1st Qu.:0.8609 1st Qu.:1.533
## Median :0.2465 Median :0.8762 Median :1.561
## Mean :0.2557 Mean :0.8760 Mean :1.560
## 3rd Qu.:0.2907 3rd Qu.:0.8908 3rd Qu.:1.587
## Max. :0.3743 Max. :0.9327 Max. :1.661
##
## mining info:
## data ntransactions support confidence
## Norbt 1448 0.2 0.7
Note the qualitative rise in support, confidence, and lift in this refined subset, compared to the previous two subsets. We inspect the first few rules of this refined subset, and then sort it by lift and inspect it again:
inspect(r.refined[1:10])
## lhs rhs support confidence lift
## 1 {Sox17} => {Erg} 0.3466851 0.8408710 1.497640
## 2 {HoxB4} => {Erg} 0.3743094 0.8262195 1.471545
## 3 {HoxB4,
## Sox17} => {Erg} 0.2368785 0.8863049 1.578560
## 4 {Lmo2,
## Sox17} => {Erg} 0.2265193 0.8453608 1.505636
## 5 {Sox17,
## Tbx20} => {Erg} 0.2686464 0.8155136 1.452477
## 6 {Meis1,
## Sox17} => {Erg} 0.3301105 0.8675136 1.545092
## 7 {Sox17,
## Sox7} => {Erg} 0.3466851 0.8451178 1.505204
## 8 {Notch1,
## Sox17} => {Erg} 0.3459945 0.8462838 1.507280
## 9 {Etv2,
## Sox17} => {Erg} 0.3197514 0.8297491 1.477831
## 10 {Hhex,
## Sox17} => {Erg} 0.3397790 0.8424658 1.500480
r.refined.sorted <- sort(r.refined, by = "lift")
inspect(r.refined.sorted[1:10])
## lhs rhs support confidence lift
## 1 {Cbfa2t3h,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1} => {Erg} 0.2009669 0.9326923 1.661179
## 2 {Cbfa2t3h,
## Ets1,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1} => {Erg} 0.2009669 0.9326923 1.661179
## 3 {Cbfa2t3h,
## Fli1,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1} => {Erg} 0.2009669 0.9326923 1.661179
## 4 {Cbfa2t3h,
## Etv6,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1} => {Erg} 0.2009669 0.9326923 1.661179
## 5 {Cbfa2t3h,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1,
## Tal1} => {Erg} 0.2009669 0.9326923 1.661179
## 6 {Cbfa2t3h,
## FoxH1,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1} => {Erg} 0.2009669 0.9326923 1.661179
## 7 {Cbfa2t3h,
## FoxO4,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1} => {Erg} 0.2009669 0.9326923 1.661179
## 8 {Cbfa2t3h,
## Ets2,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1} => {Erg} 0.2009669 0.9326923 1.661179
## 9 {Cbfa2t3h,
## Ets1,
## Fli1,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1} => {Erg} 0.2009669 0.9326923 1.661179
## 10 {Cbfa2t3h,
## Ets1,
## Etv6,
## Hhex,
## HoxB4,
## Ikaros,
## Lyl1,
## Notch1} => {Erg} 0.2009669 0.9326923 1.661179
In each of the ten rules, HoxB4 appears with Lyl1 and Notch1, confirming what is already developmentally known about the activation of Erg via HoxB4 (Moignard et al.). We can confirm this by plotting the 25 rules generated from all Hox and Sox factors with the most lift, and comparing it to the plot of the 25 rules generated from only the HoxB4 and Sox17 factors with the most lift:
r.hoxSox.sorted <- sort(r.hoxSox, by = "lift")
plot(r.hoxSox.sorted[1:25], method = "graph")
plot(r.refined.sorted[1:25], method = "graph")
We can reverse this analysis by analyzing what sort of implications the presence of Erg has. Thus, Erg becomes our left-hand side. We lower the requirements on support and confidence to find more rules, and then inspect and graph the results:
rl <- apriori(Norbt, control = list(verbose = F), parameter = list(minlen=2, supp = 0.1, conf = 0.5), appearance = list(default = "rhs", lhs = c("Erg")))
summary(rl)
## set of 23 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 23
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence lift
## Min. :0.2873 Min. :0.5117 Min. :0.9088
## 1st Qu.:0.3999 1st Qu.:0.7122 1st Qu.:1.0017
## Median :0.5311 Median :0.9459 Median :1.1094
## Mean :0.4832 Mean :0.8606 Mean :1.1545
## 3rd Qu.:0.5566 3rd Qu.:0.9914 3rd Qu.:1.2420
## Max. :0.5615 Max. :1.0000 Max. :1.4976
##
## mining info:
## data ntransactions support confidence
## Norbt 1448 0.1 0.5
inspect(rl)
## lhs rhs support confidence lift
## 1 {Erg} => {Sox17} 0.3466851 0.6174662 1.4976399
## 2 {Erg} => {HoxB4} 0.3743094 0.6666667 1.4715447
## 3 {Erg} => {Sfpi1} 0.3425414 0.6100861 1.2389967
## 4 {Erg} => {Myb} 0.2872928 0.5116851 0.9319749
## 5 {Erg} => {Ikaros} 0.2914365 0.5190652 0.9088348
## 6 {Erg} => {Lmo2} 0.3736188 0.6654367 1.2104928
## 7 {Erg} => {Tbx20} 0.4254144 0.7576876 1.2231122
## 8 {Erg} => {Meis1} 0.5138122 0.9151292 1.3029567
## 9 {Erg} => {Sox7} 0.5379834 0.9581796 1.3764326
## 10 {Erg} => {Notch1} 0.5448895 0.9704797 1.3332586
## 11 {Erg} => {Etv2} 0.5214088 0.9286593 1.2450913
## 12 {Erg} => {Hhex} 0.5469613 0.9741697 1.2255411
## 13 {Erg} => {Ets1} 0.5580110 0.9938499 1.2052720
## 14 {Erg} => {Runx1} 0.4779006 0.8511685 0.9971618
## 15 {Erg} => {Lyl1} 0.5283149 0.9409594 1.0996846
## 16 {Erg} => {Cbfa2t3h} 0.5310773 0.9458795 1.1001072
## 17 {Erg} => {Fli1} 0.5600829 0.9975400 1.1093993
## 18 {Erg} => {Etv6} 0.5614641 1.0000000 1.0454874
## 19 {Erg} => {Tal1} 0.5614641 1.0000000 1.0342857
## 20 {Erg} => {Ldb1} 0.5552486 0.9889299 0.9992816
## 21 {Erg} => {FoxH1} 0.5545580 0.9876999 0.9945684
## 22 {Erg} => {FoxO4} 0.5580110 0.9938499 0.9979852
## 23 {Erg} => {Ets2} 0.5614641 1.0000000 1.0041609
plot(rl, method = "graph")
While the rules implying Sox17 and HoxB4 do not have the most support (they are less frequent in the transaction database), they appear to be more significant in that whenever a cell expresses Sox17, it also tends to express Erg (which is similarly shown for HoxB4). The inspection also demonstrates a tight relationship to Sox7. No other relationship comes close.
We can also build all of the possible rules from our transaction data, given some constraints on support and confidence:
rules <- apriori(Norbt, parameter = list(support = 0.2, confidence = 0.6))
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.6 0.1 1 none FALSE TRUE 0.2 1 10
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[32 item(s), 1448 transaction(s)] done [0.00s].
## sorting and recoding items ... [30 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [7.54s].
## writing ... [14516733 rule(s)] done [2.36s].
## creating S4 object ... done [7.00s].
summary(rules)
## set of 14516733 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4 5 6 7 8 9
## 17 524 6555 46443 213911 693724 1660239 3018109 4238681
## 10
## 4638530
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 8.000 9.000 8.662 10.000 10.000
##
## summary of quality measures:
## support confidence lift
## Min. :0.2003 Min. :0.6000 Min. :0.7379
## 1st Qu.:0.2196 1st Qu.:0.9052 1st Qu.:1.0189
## Median :0.2514 Median :0.9812 Median :1.1240
## Mean :0.2752 Mean :0.9301 Mean :1.1740
## 3rd Qu.:0.3052 3rd Qu.:0.9981 3rd Qu.:1.2775
## Max. :0.9959 Max. :1.0000 Max. :2.3498
##
## mining info:
## data ntransactions support confidence
## Norbt 1448 0.2 0.6
We sort our rules by lift and graph the first thousand in a grouped matrix:
rules.sorted <- sort(rules, by = "lift")
plot(rules.sorted[1:1000], method = "grouped")
The graph shows a tight relationship between Gata1 and Gfi1b, which represents the fact that the two factors complex together “to control expression of genes involved in the development and maturation of erythrocytes and megakaryocytes” (Osawa et al.).
Chen, Shu-Chuan, Tsung-Hsien Tsai, Cheng-Han Chung, and Wen-Hsiung Li. “Dynamic Association Rules for Gene Expression Data Analysis.” BMC Genomics 16, no. 1 (2015): 786. doi:10.1186/s12864-015-1970-x.
Hastie, Trevor, Robert Tibshirani, and J. H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer Series in Statistics. New York, NY: Springer, 2009.
van der Maaten, L.J.P., E.O. Postma, and H.J. van den Herik. Dimensionality reduction: A comparative review. Online Preprint, 2008.
Moignard, Victoria, Steven Woodhouse, Laleh Haghverdi, Andrew J Lilly, Yosuke Tanaka, Adam C Wilkinson, Florian Buettner, et al. “Decoding the Regulatory Network of Early Blood Development from Single-Cell Gene Expression Measurements.” Nature Biotechnology 33, no. 3 (February 9, 2015): 269–76. doi:10.1038/nbt.3154.
Osawa, Mitsujiro, Tomoyuki Yamaguchi, Yukio Nakamura, Shin Kaneko, Masafumi Onodera, Ken-Ichi Sawada, Armin Jegalian, Hong Wu, Hiromitsu Nakauchi, and Atsushi Iwama. “Erythroid Expansion Mediated by the Gfi-1B Zinc Finger Protein: Role in Normal Hematopoiesis.” Blood 100, no. 8 (October 15, 2002): 2769–77. doi:10.1182/blood-2002-01-0182.
Tibshirani, Robert, Guenther Walther, and Trevor Hastie. “Estimating the Number of Clusters in a Data Set via the Gap Statistic.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63, no. 2 (May 2001): 411–23. doi:10.1111/1467-9868.00293.